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male  and  female.   Three  training  passes  had  a  slightly 
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I .   INTRODUCTION 

A.   BACKGROUND 

1 .   Voice  Technology 

"It  is  only  a  matter  of  time  until  automatic  speech 
recognition  (ASR)  becomes  a  major  force  in  man-machine 
communication  because  of  the  inherent  advantages  of 
speech  communication  and  our  increasing  need  to  commu- 
nicate with  machines.   The  inherent  advantages  of  speech 
arise  from  its  universality,  convenience,  and  speed." 
[Ref.  1]. 

Speech  is  the  human's  fastest  and  most  convenient 
method  of  communicating  and  consequently  little  or  no 
operator  training  is  required  if  speech  is  used  as  the  inter- 
face between  man  and  computer.   In  experiments  involving 
speech  and  other  forms  of  machine  communication  (e.g., 
typing) ,  information  is  exchanged  almost  twice  as  fast  with 
speech  [Ref.  2].   In  addition  to  the  speed  and  ease  of 
training,  speech  input  frees  the  operators'  hands  and  eyes 
for  other  tasks  [Ref.  3] . 

The  use  of  voice  input  to  machines  can  be  categorized 
into  three  modes  of  operation: 

--  voice  response. 

--  speaker  verification. 

--  speech  recognition. 

VOICE  RESPONSE  is  the  area  of  voice  input  which  deals 
with  speech  synthesis  --  voice  readout  of  computer-stored 
data.   The  appropriate  message  is  selected  from  a  stored 


vocabulary  by  a  synthesis  program  and  then  given  to  a 
synthesizer  device  which  generates  a  signal  for  transmission 
over  a  voice  circuit  [Ref.  4]. 

SPEAKER  VERIFICATION  involves  authenticating  the 
identity  of  a  speaker  according  to  measurements  on  his  voice 
signal.   Applications  for  speaker  verification  systems 
include  voice  lock/unlock  security  systems  and  banking  and 
credit  transaction  [Ref.  5]. 

SPEECH  RECOGNITION  is  giving  commands  to  machines 
by  voice.   The  machine  does  not  have  to  identify  the  speaker, 
only  "recognize"  what  is  said.   The  commands  can  be  given 
by  any  speaker  as  long  as  his  or  her  voice  patterns  match 
those  parameters  for  the  desired  stored  command.   Speech 
recognition  systems  are  used  for  baggage  and  parcel  sorting, 
quality  control  on  production  lines  and  voice  direction  of 
machine  tools.   They  are  typified  by  small  word  vocabularies 
spoken  by  a  small  population  of  users  or  large  vocabularies 
[several  hundred  words)  for  speakers  who  allow  the  machine 
to  calibrate  their  voices  [Ref.  6] . 

The  first  experiments  with  speech  input  to  machines 
were  done  in  the  1950  •  s  using  vowel  and  digit  recognition 
systems.   Today  there  are  commercially  available  isolated 
word  recognition  systems  which  easily  handle  small  vocabularies 
from  a  known  set  of  speakers.   Actual  systems  in  use  today 
include  United  Air  Lines  baggage  handling  system,  Ford 
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Motor  Company's  assembly  line  inspection  of  cars  and  Union 

Carbide's  nuclear  products  manipulation  system  at  Oak  Ridge 

and  Lockheed's  quality  control  inspection  line  in  Sunnyvale, 

California. 

There  are  two  features  which  characterize  the 

complexity  of  the  speech  recognition  task: 

--  whether  the  speech  is  connected  or  spoken  one  word  at 
a  time . 

--  the  size  of  the  vocabulary. 

In  connected  speech  the  acoustic  characteristics  of  sounds 

and  words  have  greater  variability.   In  addition,  it  is 

difficult  to  determine  where  one  word  ends  and  the  next 

begins .   As  the  number  of  words  in  the  vocabulary  and  the 

number  of  different  contextual  variations  per  word  increase, 

the  storage  required  to  store  all  reference  patterns  becomes 

enormous . 

The  principal  difficulty  in  automatic  speech  recog- 
nition is  not  due  to  a  lack  of  speech  understanding  but  to 
the  massive  amount  of  memory  and  time  required  to  store  and 
process  the  required  data.   Recent  progress  has  been  limited 
more  by  advances  in  data  processing  than  in  speech  recognition 
technology  [Ref .  7] . 

Therefore,  a  major  disadvantage  of  speech  recognition 

systems  is  the  requirement  for  large  amounts  of  memory  and 

processing  time.   Some  additional  problems  are: 

--  speaker  variability  due  to  sex  and  dialect  makes 
recognition  very  difficult. 
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--  speech  communication  is  not  private. 

--  speech  communication  may  be  subject  to  environmental 
noise  and  distortions. 

--  voice  input  is  expensive  in  comparison  to  other 
input/output  devices.   (The  cost  of  voice  input 
devices  ranges  from  $200  to  $80,000  which  includes 
a  wide  variety  of  capabilities.) 

In  spite  of  these  restrictions,  applications  for 

voice  systems  today  include  several  areas: 

a.  voice  readout  of  numerals, 
(lj  telephone  numbers. 

(2)  assembly  of  equipment. 

(3)  stock  price  quotations. 
C4)  inventory  reporting. 

C5)  automatic  directory  assistance. 

b.  industrial  applications. 

(1)  special  purpose  computer  programming  for  machine 
tools . 

(2)  quality  control  inspection  systems. 

C3)  equipment  handling  and  sorting  systems. 

c.  editing  of  financial  information. 

This  thesis  will  address  another  application  for  today's 
voice  recognition  systems  --  that  of  command  and  control.   The 
implication  here  is  not  command  and  control  in  the  sense  of 
voice  communication  with  machines  but  in  the  military  appli- 
cation of  a  management  information  system  which  provides 
data  on  resources  available. 
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2 .   Command,  Control  and  Communications  (C5) 

In  1972  the  Honeywell  6000  computer  (H6000)  was 
installed  at  Commander  in  Chief  Naval  Forces  Europe 
CCINCUSNAVEUR)  in  support  of  the  World  Wide  Military 
Command  and  Control  System  (WWMCCS) .   The  H6000  transferred 
CINCUSNAVEUR  from  the  first  generation  of  computer  systems  -- 
characterized  by  card  decks  and  single  job  processing  --to 
the  third  generation  of  multiprogramming,  timesharing  and 
terminal  input/output.   What  existed  at  CINCUSNAVEUR  in 
the  way  of  "computer  support"  prior  to  the  H6000  was  a  very 
"user  unfriendly"  ANYUK  computer  which  required  a  great 
deal  of  expertise  and  very  specific  procedures  to  operate. 

Consequently,  when  the  H6000  was  installed,  the  staff, 
conditioned  by  the  difficulties  of  using  the  prior  data 
processing  equipment,  was  very  reluctant  to  have  a  computer 
replace  their  filing  cabinets.   After  several  years  of 
software  changes,  updates  to  the  Navy  WWMCCS  Software 
Standardization  System  (NWSS)  were  being  passed  from  the 
fleet  by  AUTODIN  to  the  H6000.   Messages  were  not  manually 
manipulated  unless  they  were  kicked  out  of  the  system  because 
of  errors . 

In  spite  of  the  fact  that  inputs  to  the  database  were 
being  electrically  transmitted  from  AUTODIN  to  the  H6000 
before  the  communication  center  could  distribute  the  paper 
copy,  the  staff,  for  the  most  part,  avoided  the  NWSS  query 
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module  and  held  to  their  filing  cabinets.   Training  sessions 
given  by  the  software  developers  on  how  to  use  NWSS  were  not 
well  attended.   User  reaction  to  the  system  was  so  negative 
that  a  separate  shop  for  monitoring  the  database  and  correct 
ing  the  error  messages  had  to  be  formed  using  ADP  resources. 
That  is,  the  users  who  were  supposed  to  be  responsible  for 
data  content  passed  the  responsibility  off  to  the  data 
processors . 

In  1978,  a  preliminary  evaluation  of  the  man-machine 
interface  of  the  NWSS  query  module  was  done  by  Naval  Ocean 
Systems  Center  [Ref .  8] .   The  reason  for  the  study  was  to 
investigate  the  possibility  of  simplifying  the  query  module 
since  the  module,  while  it  is  very  powerful,  is  also  rather 
confusing  to  the  infrequent  user.   There  are  nonstructured 
query  systems  being  tested  on  data  bases  similar  to  NWSS  -- 
LADDER,  for  example  --  which  would  provide  the  user  with 
a  much  easier  access  to  the  data.   LADDER  (Language  Access 
to  Distributed  Data  Bases  with  Error  Recovery)  will  allow  a 
user  to  ask  the  computer  a  question  in  plain  English  (Where 
is  the  Kennedy?")  instead  of  requiring  a  specific  format  and 
specific  command  words.   The  free  format  LADDER  query  system 
has  been  in  test  and  development  status  since  1977. 

But  let's  take  it  a  step  further.   Even  if  a 
relatively  free  format  query  system  was  available  from  NWSS, 
chances  are  a  good  percentage  of  the  staff  would  still  not 
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be  interested  --  because  it  still  requires  the  user  to  sit 
in  front  of  a  terminal  and  find  characters  which  are 
randomly  spread  over  the  keyboard.   (Would  Star  Trek  ever 
have  been  so  popular  if  Captain  Kirk  had  to  wheel  up  to 
a  keyboard  and  begin  typing  instead  of  just  facing  the  panel 
and  speaking  into  it?)   If  using  the  NWSS  query  module  was 
as  easy  as  loading  a  tape  of  voice  patterns  and  "speaking" 
the  query  to  the  computer,  would  there  be  less  reluctance 
on  the  part  of  the  staff  and  command  center  team  to  use  the 
automated  data  base  instead  of  going  to  the  files? 

The  problem  of  C3  today  is  significantly  more  complex 
than  at  any  time  in  the  past.   To  be  competitive  in  today's 
automated  world,  some  extension  of  man's  memory  and  compu- 
tational abilities  is  needed.   How  can  this  capability  be 
provided  without  requiring  an  excessive  amount  of  training? 
Is  it  possible  to  provide  a  computer  tool  without  requiring 
typing  skills  to  use  it? 

The  easier  it  is  to  access  the  data,  the  more  likely 

the  staffer  will  be  to  use  it.   The  easiest  way  for  a  nondata 

processor  to  interface  with  a  computer  is  simply  to  talk  to 

it.   Consideration  for  the  use  of  a  voice  interface  with  the 

automated  information  system  would  include  such  questions 

as  : 

Is  it  feasible  to  utilize  a  voice  recognition  system 
in  an  environment  such  as  a  command  center  where  each 
member  of  the  watch  team  could  query  the  computer  by 
voice? 
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Is  it  cost  effective  to  train  a  military  member 
to  use  a  voice  recognition  system  and  could  it  be 
done  in  a  negligible  amount  of  time? 

Would  voice  input  in  terms  of  today's  technology 
be  adaptable  for  female  as  well  as  male  usage? 

What  are  the  tradeoffs  in  using  three,  five  or 
ten  training  passes  in  terms  of  training  time,  error 
rates  and  user  psychology? 

Would  it  be  feasible  in  terms  of  system  resources 
to  store  voice  patterms  for  every  member  of  the  watch 
section  on  the  computer? 

Would  stress  vary  the  voice  patterns  to  such  an 
extent  that  the  voice  input  system  would  be  unacceptable 
in  the  varying  stress  situations  of  the  command  center 
environment? 

With  these  thoughts  in  mind,  this  thesis  investigates  the 
use  of  a  voice  recognition  system  by  military  operators  -- 
male,  female,  officer,  enlisted  --  from  technical  and  non- 
technical backgrounds. 

B.   OBJECTIVES 

The  objective  of  this  thesis  was  to  explore  the  use  of 

a  voice  recognition  system  by  a  random  sample  of  active  duty 

military  personnel.   Specifically,  to  determine  the  effective 

ness  of  such  a  system  in  each  of  the  following  three  cases: 

1.   Male  Operators  versus  female  operators: 

The  female  voice  generally  has  a  higher  pitch  than 
the  male  voice  due  to  the  spread  of  the  harmonics  in 
the  frequency  spectrums  of  the  female.   This  factor 
causes  problems  in  frequency  resolution  and  conse- 
quently the  female  voice  has  been  particularly  hard 
for  machines  to  recognize  [Ref.  9].   There  has  been 
very  little  work  done  with  female  subjects  and  voice 
recognition  systems.   Any  system  to  be  used  in  a 
command  center  environment  will  more  than  likely  have 
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female  as  well  as  male  operators.   Thus,  one  of  the 
main  objectives  of  this  study  was  to  compare  the 
error  rates  of  the  machine  using  operators  of 
both  sexes  . 

2.  Officer  operators  versus  enlisted  operators: 

Another  group  of  subjects  that  has  had  little 
documented  experience  with  the  voice  recognition 
system  is  that  of  enlisted  personnel.   Seemingly, 
there  should  be  no  difference  between  officer  and 
enlisted.   However,  this  assumption  has  not  been 
tested.   The  likely  candidate  for  use  of  the  voice 
recognition  system  in  the  command  center  environment 
would  be  the  enlisted  member  of  the  watch  team. 
(Hopefully,  the  ease  of  use  introduced  by  voice  access 
would  change  this!)   The  emphasis  in  this  study  was 
in  the  use  of  operational  personnel.   The  intent  was 
to  be  realistic  in  the  experience  levels  of  the 
proposed  operators  in  order  to  provide  a  true  picture 
of  the  adaptability  of  the  operators  to  the  equipment 
and  the  training  required  for  them  to  use  the 
equipment. 

3.  Three,  five,  or  ten  training  passes  to  train  the 
voice  recognition  system: 

The  accepted  algorithm  used  to  train  the  voice 
recognition  system  in  this  experiment  requires  ten 
training  passes  to  "learn"  to  recognize  the  operator's 
utterance.   In  an  extensive  vocabulary  this  can  demand 
a  considerable  amount  of  time  and  can  conceivably 
introduce  errors  in  the  training  process  if  boredom 
and/or  fatigue  take  over.   There  is  an  algorithm 
available  to  train  using  five  or  three  utterances  as 
well  as  ten.   The  final  area  examined  was  the  use 
of  three  or  five  training  passes  vice  ten. 
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II.   METHOD 

A.  DESIGN 

Figure  1  shows  the  conceptual  design  for  this  experiment. 
It  is  a  three-way  nested  hierarchal  analysis  of  variance. 
Each  of  the  four  groups  --  male  enlisted,  male  officer, 
female  enlisted,  female  officer  --  consists  of  ten  subjects. 
Each  subject  trained  and  tested  the  voice  recognition  system 
using  three,  five  and  ten  training  passes  in  a  random  order. 

B.  SUBJECTS 

Forty  active  duty  military  volunteers  participated  in 
this  study.   There  were  ten  female  officers,  ten  female 
enlisted,  ten  male  officers  and  ten  male  enlisted. 

The  enlisted  subjects  were  all  Navy  members  stationed 
at  the  Naval  Postgraduate  School.   Their  ranks  ranged  from 
El  to  E8.   Their  rates  were:   Religious  Program  Specialist, 
Yeoman,  Personnelman ,  Mess  Management  Specialist,  Intelligence 
Specialist,  Data  Processor,  Storekeeper,  Air  Intercept 
Controller,  Electronics  Technician  (including  fire  control 
specialist) . 

The  officers  were  from  three  U.S.  services  --  Navy,  Army, 
Air  Force  --  and  the  Canadian  Forces.   They  ranged  in  grade 
from  03  to  05.  All  but  two  were  NPS  students  in  the  C3, 
Operations  Research,  Telecommunications  Management, 
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Intelligence,  Personnel  Management  and  Communications 
Engineering  curricula.   The  other  two  were  an  Army  chemical 
officer  from  Fort  Ord  and  an  Air  Force  navigator  stationed 
at  the  Joint  Chiefs  of  Staff.   The  backgrounds  of  the  officers 
were:   special  warfare,  National  Oceanic  and  Atmospheric 
Administration,  ADP ,  intelligence,  telecommunications, 
cryptology,  acquisition,  aviator,  aerospace  engineering, 
management  analysis  and  communications. 

Based  on  a  questionnaire  given  to  each  subject  before 
performing  the  exercise,  all  but  four  thought  voice  input 
would  be  easier  and  less  frustrating  than  typing  as  a  means 
of  input  to  the  computer.   Sixteen  of  the  forty  subjects 
had  used  or  seen  voice  input  used  but  only  two  had  more 
than  an  introduction  to  voice  response  systems. 

C.   EQUIPMENT 

The  equipment  used  in  this  research  was  a  Threshold 
Technology,  Incorporated,  Model  T600  discrete  utterance 
voice  recognition  system  which  was  located  inside  an 
Industrial  Acoustic  Company  sound  reduction  chamber.   The 
microphone  used  was  a  Shure  SM10  head  microphone. 

The  Model  T600  consists  of  four  basic  components  (see 

Figure  2)  : 

--  preprocessor  unit  consisting  of  an  analog  speech 
preprocessor  and  a  digital  input/output  interface. 

--  operator  console/microphone  preamplifier. 
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--  tape  cartridge  unit. 

--  CRT  display  and  console. 
The  preprocessor  accepts  the  speech  from  the  microphone 
preamplifier,  extracts  speech  parameters  and  converts  these 
to  digital  signals  which  are  processed  by  the  microcomputer. 
The  microcomputer  compares  the  input  signals  with  stored 
reference  patterns  to  determine  which,  if  any,  of  the  vocabu- 
lary words  were  spoken.   If  a  close  match  is  found  between 
the  input  speech  pattern  and  one  of  the  reference  patterns, 
a  user  defined  character  string  is  sent  to  the  user's  device 
via  the  output  interface.   If  no  match  is  found  the  system 
emits  a  "beep"  sound. 

The  reference  patterns  are  generated  during  the  "training 
mode"  which  requires  a  speaker  to  repeat  several  repetitions 
of  each  utterance  with  a  variety  of  inflections  as  would  be 
used  in  normal  speech.   The  number  of  repetitions  required 
is  usually  ten  but  for  this  experiment  additional  logic  was 
added  to  the  T600  to  allow  the  use  of  three  or  five  repeti- 
tions.  An  utterance  can  be  a  single  word  ("grid")  or  group 
of  words  ("command  and  control")  lasting  from  a  tenth  of  a 
second  to  two  seconds.   The  only  requirement  is  that  the 
utterance  contain  no  pauses  of  a  tenth  of  a  second  or 
greater.   If  a  tenth  of  a  second  pause  is  made,  the  T600 
will  treat  the  sound  as  two  utterances  instead  of  the  intended 
one.   Up  to  256  utterances  are  allowed  on  this  system  [Ref.  10] 
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Each  utterance  processed  by  the  T600  is  passed  through 
nineteen  bandpass  filters  which  span  the  speech  spectrum. 
The  overall  signal  spectral  shape  is  then  described  using 
a  spectral  shape  detector  which  calculates  the  rate  of  change 
of  energy  level  with  respect  to  frequency.   The  spectral 
shape  and  its  changes  over  time  are  calculated  every  two 
milliseconds  to  determine  the  presence  or  absence  of  thirty- 
two  acoustic  features .   When  the  end  of  the  utterance  is 
detected,  the  duration  of  the  utterance  is  divided  into 
sixteen  time  segments  and  reconstructed  into  a  normalized 
time  base.   The  T600  extracts  a  512-bit  feature  matrix  --  32 
binary  features  by  16  time  features  --  for  each  version  of 
an  utterance.   Then  all  matrices  (three,  five  or  ten)  are 
combined  to  produce  a  single  reference  matrix  for  an  element. 

When  an  utterance  is  spoken  for  recognition  by  the  T600 
a  512-bit  descriptive  matrix  is  calculated  and  weighted 
correlations  between  this  matrix  and  each  reference  matrix 
describing  the  vocabulary  utterances  are  calculated.   The 
vocabulary  with  the  largest  correlation  exceeding  some  preset 
threshold  value  is  then  selected  as  the  utterance  spoken. 
If  no  correlation  exceeds  the  preset  threshold  value  the 
T600  emits  a  "beep"  sound  [Ref.  11]. 

The  T600  has  a  magnetic  tape  cartridge  unit  which  allows 
the  user  to  build  his  vocabulary  reference  patterns  and  store 
them  on  a  tape  cartridge.   When  the  subject  wants  to  use  the 
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equipment,  the  tape  is  loaded  into  the  preprocessor  unit. 
This  also  allows  a  user  to  build  a  vocabulary  for  different 
tasks.   He  can  then  load  the  voice  patterns  for  the  task 
he  needs  to  execute.   Since  the  operator  is  not  dependent 
on  any  large  computer  to  store  his  voice  patterns,  the  equip 
ment  can  easily  be  moved  and  still  be  operational. 

D.   PROCEDURE 

At  the  beginning  of  the  session,  subjects  were  given  a 
questionnaire  regarding  their  opinions  on  voice  input  versus 
manual  typing.   (See  Appendix  A.)   The  objectives  of  the 
experiment  were  explained  along  with  an  introduction  to  the 
voice  recognition  equipment  used  and  the  procedure  to  be 
followed.   The  subject  was  then  seated  in  a  controlled 
acoustical  environment  chamber  in  front  of  a  video  display 
and  given  instructions  on  how  to  train  the  equipment.   (See 
Appendix  B.) 

The  vocabulary  used  in  this  test  consisted  of  fifty 
utterances  --  words  and  phrases  --  varying  in  length  from 
one  to  five  syllables.   The  utterances  were  not  chosen  to 
test  the  machine's  ability  to  distinguish  between  similar 
sounds  --  "get"  and  "met,"  for  example.   The  only  considera- 
tion in  choosing  the  vocabulary  was  to  have  the  same  number 
of  utterances  in  each  syllable  category  --  ten  one-syllable 
words,  ten  two-syllable  words,  etc.   The  vocabulary  list 
is  shown  in  Appendix  C.   Appendix  D  contains  the  Confusion 
Matrix. 


24 


Once  the  subject  was  introduced  to  the  experiment  and 
equipment,  the  head  mike  was  mounted  and  the  subject  began 
training  the  fifty-word  vocabulary  using  either  three,  five 
or  ten  training  passes.   The  number  of  training  passes  used 
first  was  randomly  determined  so  that  each  would  be  used 
first  the  same  number  of  times.   That  is,  one -third  of  the 
subjects  started  out  using  ten  training  passes.   Another  third 
used  three  training  passes  first  and  the  last  third  started 
out  using  five  training  passes. 

The  training  procedure  involved  repeating  an  utterance 
the  required  number  of  times  and  then  testing  the  equipment 
by  repeating  the  utterance  two  or  three  times.   If  the 
machine  did  not  respond  correctly  two  out  of  three  times 
the  utterance  was  retrained.   Once  the  entire  vocabulary 
was  trained,  the  subject  tested  the  equipment  by  reading 
through  the  vocabulary  list  twice  (100  utterances) .   Any 
"beeps"  or  incorrect  responses  were  noted  by  the  experimenter. 
This  entire  procedure  was  repeated  using  a  different  number 
of  training  passes  until  each  subject  had  trained  and  tested 
the  equipment  using  three,  five  and  ten  training  repetitions. 
Subjects  were  allowed  to  rest,  ask  questions,  get  a  drink 
at  any  time  during  the  procedure. 

E.   DEPENDENT  VARIABLES 

After  the  training  session  each  subject  read  through  the 
list  of  words  two  times.   A  record  was  kept  of  each  time  the 
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machine  responsed  with  a  "beep"  or  an  incorrect  utterance 
A  record  was  also  kept  of  the  time  each  subject  took  to 
complete  the  experiment. 
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III.   ANALYSIS  AND  RESULTS 

A.  HYPOTHESES 

The  following  hypotheses  were  to  be  tested: 

1.  Hypothesis  regarding  male  and  female  subjects. 

Hn :   "There  is  no  difference  between  male  and  female 
users  of  the  voice  recognition  system." 

H  :   "The  null  hypothesis  is  false." 

2.  Hypothesis  regarding  officer  and  enlisted  subjects. 

H„  :   "There  is  no  difference  between  officer  and 

enlisted  users  of  the  voice  recognition  system." 

H  :   "The  null  hypothesis  is  false." 

3.  Hypothesis  regarding  number  of  training  passes. 

Hn :   "There  is  no  difference  in  recognition  accuracy 
when  a  different  number  of  training  passes  is 
used  in  the  voice  recognition  system." 

H  •   "The  null  hypothesis  is  false." 

B.  RESULTS  FOR  SEX 

The  results  of  this  experiment  for  male  and  female 
subjects  are  shown  graphically  in  Figure  3.   The  machine's 
performance  for  men  was  slightly  better  than  for  women  --  1.8% 
error  rate  for  men  versus  2.11  for  women  based  on  twenty 
subjects  making  6000  utterances  in  each  sex  category. 
However,  the  analysis  of  variance  (ANOVA)  results  in  Table  I 
show  an  F  ratio  of  .45  which  indicates  no  significant  statisti 
cal  difference  in  the  gender  of  the  operator.   Thus  the  null 
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TABLE  I 
ANALYSIS  OF  VARIANCE 


SOURCE 


SS 


df 


MS 


Total 


3.1013    119 


Between  Subjects     1.6172 


39 


Male/Female 

.0199 

1 

.0199 

.4584 

Enlisted/Officer 

.0183 

1 

.0182 

.4217 

Sex  x  Rank 

.0197 

1 

.0197 

.4552 

Error  (B) 

1.5594 

36 

.0433 

Within  Subjects 

Training  Passes 
Training  Passes 

x  Sex 
Training  Passes 

x  Rank 
Training  Passes 

x  Sex  x  Rank 

Error  (W) 


1.4841    80 


.2835 
.0330 

.0197 

.0314 

1.1165 


72 


1418  9.1427    .01 

0165  1.0650 

0983  6.3396    .01 

0157  1.0129 

0155 


SS 

dF 
MS 

F 
P 


sum  of  squares 
degrees  of  freedom 
mean  square 
F  ratio 
probability  of  error 
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hypothesis  is  not  rejected.   This  result  speaks  highly 
for  the  algorithm  used  by  Threshold.   It  would  appear  they 
have  a  good  handle  on  the  additional  requirements  needed 
to  process  the  female  voice. 

This  result  further  establishes  the  possibility  of  using 
a  voice  recognition  system  in  a  command  center  environment. 
The  highest  probability  of  error  occurred  with  female  subjects 
but  even  then  the  mean  percentage  error  was  only  2.1%.   That 
is,  out  of  one  hundred  utterances  (an  utterance,  again, 
being  a  single  word  or  group  of  words)  spoken  by  a  female 
watch  team  member  to  the  computer,  all  but  three  would  be 
interpreted  correctly.   If  these  utterances  were  being  typed, 
a  greater  probability  of  error  would  exist  since  one 
utterance  could  have  as  many  typing  errors  as  there  are 
characters  in  the  utterance. 

C.   RESULTS  FOR  RANK  --  OFFICER  VS.  ENLISTED 

Figure  4  shows  the  comparison  of  machine  errors  for  the 
two  categories  of  officer  and  enlisted.   The  machine's 
performance  for  the  enlisted  was  slightly  better  than  for 
officers  --  1.851  versus  2.051  mean  error  percentage  based 
on  twenty  subjects  making  6000  utterances  in  each  rank 
category. 

However,  the  statistical  results  from  the  ANOVA  (Table  I) 
show  an  F  ratio  of  .42.   Therefore,  there  is  no  significant 
statistical  difference  in  the  error  rate  of  the  T600  when 
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used  by  officer  or  enlisted  personnel.   Based  on  these 
statistics,  the  use  of  a  voice  system  should  be  favorable 
to  either  military  member  of  the  watch  team. 

D.   RESULTS  FOR  NUMBER  OF  TRAINING  PASSES  --  THREE,  FIVE 
OR  TEN 

Figure  5  shows  the  relationship  between  number  of 
training  passes  and  rank.   Figure  6  shows  the  relationship 
between  number  of  training  passes  and  sex.   In  each  case 
the  percentage  of  error  for  training  the  T600  with  five  or 
ten  training  passes  is  about  the  same  --  around  1%  error 
for  both  ranks  and  both  sexes.   However,  the  percentage 
of  error  using  three  training  passes  is  significantly 
higher  --  around  2.1%   based  on  rank  and  2 A%    to  3%  based  on 
sex. 

This  graphical  interpretation  is  proven  statistically 
in  the  ANOVA  with  a  significance  level  of  .01.   That  is,  the 
F  ratio  is  9.14  which  is  well  above  the  4.79  required  for 
an  alpha  level  of  .01.   Based  on  the  F  ratio,  the  null 
hypothesis  is  rejected.   Therefore,  there  is  a  significant 
difference  in  recognition  accuracy  of  the  T600  when  a  differ 
ent  number  of  training  passes  is  used.   A  Duncan  Range  test 
was  performed  to  verify  that  the  difference  in  performance 
was  between  three  training  passes  and  five  or  ten  training 
passes.   Five  and  ten  passes  had  about  the  same  probability 
of  error.   Even  though  three  training  passes  has  a 
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significantly  higher  percentage  of  error  over  the  five  and 
ten  passes,  it  is  still  only  a  3%  error  rate. 

The  ANOVA  also  showed  a  significant  interaction  (alpha 
level  less  than  .01)  between  the  number  of  training  passes 
used  and  the  rank  of  the  subject.   This  would  imply  that  an 
enlisted  user  would  have  a  lower  error  rate  if  he  trained 
the  system  using  five  training  passes  and  an  officer  user 
would  get  better  recognition  if  he  used  ten  training  passes. 
A  t-test  was  performed  to  determine  if  five  and  ten  passes 
for  officers  and  five  and  ten  passes  for  enlisted  were 
indeed  different  since  this  interaction  seemed  unrealistic. 
The  t-test  showed  both  t-statistics  (.7682  for  women  officers 
and  -1.3125  for  enlisted  women)  were  within  the  95%  acceptance 
region.   Therefore,  the  t-test  shows  there  is  no  difference 
in  error  rate  when  using  five  or  ten  training  passes  for 
either  officer  or  enlisted  category. 

A  possible  explanation  for  enlisted  performance  being 
lower  with  ten  training  passes  is  that  five  passes  allowed 
enough  variation  to  build  a  good  identity  matrix  and  ten 
training  passes  invited  such  a  degree  of  boredom  that  the 
performance  was  degraded. 

It  is  interesting  to  note  although  the  manufacturer 
recommends  ten  training  passes  for  the  best  performance  of 
the  system,  the  results  of  this  study  show  no  significant 
difference  between  five  and  ten  training  passes.   This 
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result  might  only  apply  when  a  relatively  small  vocabulary 
is  used  but  in  a  crisis  situation  this  could  suggest  the 
use  of  five  training  passes  to  get  a  needed  vocabulary  on 
tape  quickly.   As  one's  experience  with  the  T600  increases, 
the  use  of  fewer  training  passes  may  be  sufficient. 

The  order  in  which  subjects  trained  the  equipment  with 
the  different  number  of  training  passes  was  randomly  assigned 
to  prevent  any  biases  in  case  learning  or  fatigue  factors 
were  involved.   Figure  7  shows  the  percent  error  rate  versus 
number  of  training  passes  used  in  the  order  subjects  trained. 
That  is,  for  all  subjects  who  started  out  the  experiment 
using  three  training  passes,  the  percent  error  rate  was  2.3%. 
For  all  subjects  who  used  five  training  passes  first,  the 
percent  error  rate  was  2%.   Those  subjects  who  used  three 
training  passes  after  training  with  five  and  ten  passes 
had  a  percent  error  rate  of  2.9%. 

If  an  improvement  due  to  experience  was  a  factor  then 
five  training  passes  was  the  only  one  which  demonstrated 
this.   However,  the  increase  in  errors  as  three  training 
passes  was  used  second  and  third  could  be  due  to  the  fact 
that  subjects  became  accustomed  to  putting  a  lot  of  inflec- 
tions in  the  utterances  and  when  only  three  passes  was  used, 
they  ran  out  of  training  passes  before  running  out  of  in- 
flections.  The  increase  in  errors  when  ten  training  passes 
was  used  last  could  easily  be  explained  as  the  fatigue 
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factor.   Most  subjects  took  twice  as  long  to  train  the  fifty- 
word  vocabulary  using  ten  training  passes  as  they  did  using 
three  passes.   By  the  time  they  were  training  and  testing 
for  the  third  time  the  novelty  had  begun  to  wear  off  and 
voices  were  getting  tired. 

A  correlation  was  run  on  three  passes  versus  five  passes  , 
five  versus  ten  and  three  versus  ten  to  see  if  a  subject  who 
performed  well  on  three  training  passes  did  better  with  five 
and  ten  passes.   Only  the  results  of  the  three-five  corre- 
lation, .67,  are  significant  at  .05.   The  five-ten  correla- 
tion was  .23  and  the  three-ten  correlation  was  .11.   Neither 
of  these  is  significantly  close  to  1  or  -1  and,  therefore, 
little  correlation  is  evident  for  these  two  cases. 

E.   RESULTS  FOR  NUMBER  OF  UTTERANCE  SYLLABLES  --  1,  2,  3, 
4,  5 

Figures  8  through  10  show  the  error  recognition  rate 
for  the  number  of  training  passes  versus  the  number  of  syllables 
in  the  utterance.   In  Figure  8,  using  three  training  passes, 
the  T600  misinterpreted  one -syllable  utterances  (words  0 
through  4  and  25  through  29  in  Appendix  C)  28  times  out  of 
800  utterances  (40  subjects  x  10  utterances  x  2  repetitions 
for  each  utterance)  for  a  percentage  error  rate  of  3.5%. 

With  one  exception  the  percentage  error  rate  decreased  as 
the  number  of  syllables  increased  for  all  three  training 
matrices.   This  seems  reasonable  since  a  greater  number  of 
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syllables  give  the  T600  more  unique  data  to  build  a  recog- 
nition matrix  for  the  utterance.   The  exception  for  both 
three  and  five  passes  is  two  syllables.   That  is,  the 
percentage  error  rate  decreases  for  utterances  from  one  to 
five  syllables  with  the  exception  of  two  syllables  where 
the  error  rate  is  greatest.   In  the  case  of  ten  training 
passes,  the  exception  is  three-syllable  utterances,  with 
one  syllable  having  the  greatest  error  rate. 

The  percentage  error  rate  for  five  training  passes  is 
significantly  better  than  three  in  all  syllable  categories. 
With  the  exception  of  two  and  five  syllables  it  is  also 
better  than  ten  training  passes.   The  best  system  performance 
was  using  five  syllable  utterances  and  ten  training  passes. 
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IV.   DISCUSSION  AND  CONCLUSIONS 

The  main  points  brought  out  in  the  previous  results 
section  showed  that : 

1.  There  was  no  difference  in  error  rates  among  the 
categories  of  officer  and  enlisted  users  of  the 
voice  recognition  system. 

2.  There  was  no  difference  in  error  rates  among  the 
categories  of  female  and  male  users  of  the  system. 

3.  There  was  a  significant  difference  in  error  rates 
of  all  categories  when  using  three  training  passes 
vice  five  or  ten  passes  but  the  five  and  ten  training 
passes  had  the  same  error  rates. 

4.  There  was  significant  interaction  between  rank 
and  the  number  of  training  passes  used. 

Based  on  these  results  there  should  be  no  problem 

technically  or  psychologically  with  the  use  of  voice 

recognition  systems  by  military  men  and  women,  officer 

or  enlisted.   Although  this  experiment  was  conducted  in 

a  sound  reduction  chamber,  there  are  two  T600  voice  recog 

nition  systems  located  in  the  C3  Laboratory  at  the  Naval 

Postgraduate  School  which  are  frequently  in  use.   The  C3 

Laboratory  simulates  the  environment  of  a  command  center. 

There  have  been  no  problems  with  background  noise  in  the 

use  of  this  voice  system.   Professor  R.  Elster  [Ref.  12] 

found  similar  results  with  his  study  on  The  Effects  of 

Certain  Background  Noises  on  the  Performance  of  a  Voice 

Recognition  System. 
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The  enthusiasm  and  ease  with  which  the  subjects  used 
and  trained  the  equipment  are  positive  signs  for  the 
successful  use  of  voice  recognition  systems  in  command  centers 
At  the  time  of  this  writing,  a  T600  system  has  been  placed 
in  the  command  center  at  Commander  in  Chief  Pacific  Fleet 
(CINCPACFLT) .   During  the  week  of  1  December  1980,  Dr.  Gary 
Poock  and  LT  Ellen  Roland  of  the  Naval  Postgraduate  School 
faculty  gave  a  demonstration  of  the  T600  voice  recognition 
system  to  CINCPACFLT.   That  staff  now  has  a  T600  in  the 
command  center  which  is  being  experimented  with  in  a  variety 
of  areas . 
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APPENDIX  A 
SUBJECT  QUESTIONNAIRE  AND  ANSWER  SHEET 

Please  answer  the  following  questions  with  respect  to 
your  capabilities  . 

For  items  3-7  designate  your  feelings  from  strong 

feeling  for  manual  input  (far  left  box)  ,  no  strong  feeling 

either  way  (middle  box) ,  strong  feeling  for  voice  input 
(far  right  box) . 

For  items  8  and  9,  designate  your  feelings  from  strong 
feelings  in  favor  (far  right  box)  ,  no  strong  feelings  either 
way  (middle  box) ,  strong  feeling  against  (far  left  box)  . 

1.  Have  you  ever  used  voice  input? 

2.  Have  you  ever  seen  voice  input  used? 

3.  Which  might  be  easier,  manual  typing  input  or  voice 
input  for  communicating  with  a  computer? 

4.  Would  you  be  more  relaxed  using  manual  typing  input 
or  voice  input? 

5.  Would  you  have  more  flexibility  in  entering  items  to  a 
computer  with  voice  input  or  manual  typing  input? 

6.  Would  voice  input  or  manual  typing  allow  you  more  time 
and  freedom  to  do  other  things? 

7.  Would  you  be  more  frustrated  using  voice  input  or 
manual  typing? 
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8.  In  general,  do  you  like  the  idea  of  voice  input? 

9.  In  general,  do  you  think  you  would  like  to  use  voice 
input  in  every  day  tasks  yourself  if  it  were  applicable? 
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APPENDIX  B 
INSTRUCTIONS  TO  SUBJECTS 

The  fifty-word  vocabulary  being  used  with  the  voice 
recognizer  in  the  experiment  is  attached  to  these  instruc- 
tions.  You  will  be  required  to  repeat  each  word  of  this 
vocabulary  three,  five  and  ten  times  to  train  the  recognizer 
to  recognizer  your  particular  patterns  of  each  word.   To 
facilitate  recognition  by  the  voice  recognizer,  you  should 
include  in  the  repetitions  as  many  as  possible  of  the 
different  ways  you  might  say  the  word  in  normal  speech;  for 
example,  use  different  intonations  and  emphasis,  and  small 
variations  in  volume. 

In  order  to  keep  track  of  the  number  of  times  you 
say  each  word  when  using  ten  repetitions  and  to  reduce 
breath  noise,  it  is  best  to  speak  the  ten  repetitions  in 
several  groups.   For  example,  if  the  word  is  zero,  it  is 
better  to  group  them  as : 

000  -  000  -  0000 
or 

000  -  000  -  000  -  0 
rather  than 

0000000000. 
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Please  observe  the  following  guidelines  while  inputting 

voice  data  to  the  recognizer. 

--  Speak  each  word  crisply  and  quickly  but  do  not 
overpronounce . 

--  Leave  a  distinct  pause  (specifically,  at  least  one- 
tenth  of  a  second  of  silence)  between  each  word  so 
that  the  recognizer  can  distinguish  the  end  of  one 
word  from  the  beginning  of  the  next.   Do  not  leave 
a  period  of  silence  within  a  word  or  the  recognizer 
will  mistake  it  for  two  separate  words. 

--  Avoid  breathing  into  the  microphone  at  the  end  of 
words  as  this  will  generate  false  inputs  to  the 
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